Use
filterto select observations (rows) from a tibble.
In the previous section we learned how to select variables (columns) with all its values using the select function. In this section we are interested in a collection of observations (rows). For this we have the function filter which takes as its first argument the tibble followed by a logical condition describing the variable values constraints.
In the context of filter function a logical condition is a way to express whether some statement about an observation is true or false, if true then that observation is returned otherwise it is ignored. Note that the condition is applied to all observation in the tibble.
For example, let’s say we want to know which individuals in our pulse dataset are taller than 190 cm? You can achieve this with filter as follows:
filter(pulse, height>190) # filter all observations with height greater than 190 cm
# A tibble: 5 × 13
id name height weight age gender smokes alcohol exerc…¹ ran pulse1 pulse2
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1993_D Travis 195 84 18 male no yes high sat 71 73
2 1997_K John 191 78 19 male no yes high ran 68 136
3 1997_O Albert 194 110 25 male no no modera… sat 75 75
4 1997_Q Lance 192 105 21 male no no modera… sat 80 73
5 1997_R Chris… 194 95 18 male no yes modera… ran 84 140
# … with 1 more variable: year <dbl>, and abbreviated variable name ¹exercise
or another example would be, who’s weight is exactly 55 kg?
filter(pulse, weight==55) # filter all observations with weight equal to 55 kg
# A tibble: 5 × 13
id name height weight age gender smokes alcohol exerc…¹ ran pulse1 pulse2
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1993_H Franc… 169 55 18 female no yes modera… sat 71 77
2 1993_W Glenna 163 55 20 female no no low sat 78 74
3 1995_J Gwen 163 55 20 female no no modera… ran 70 119
4 1998_G Ursula 155 55 20 female no yes high sat 82 87
5 1998_I Elean… 168 55 24 female no no modera… sat 74 70
# … with 1 more variable: year <dbl>, and abbreviated variable name ¹exercise
filter uses logical vectors in the background to achieve the result. You’ve already seen an example of a logical vector in the section on Vectors. The result of a comparison with relational operators {==, !=, <, <=, >, >=} is a logical vector.
‘==’ stands for ‘equal to’ and ‘!=’ is ‘not equal to’
In the examples above the comparisons height>190 and weight==55 are in fact logical vectors with the same length as the number of observations (rows) in the pulse dataset. The function filter takes the logical vector and returns only the rows in positions for which the value is TRUE.
In the previous examples with filter we only had a single condition. We can expand the conditions to ask more complex questions. For example which males weigh less than 70? Here we have two conditions:
gender=="male" : male observationsweight<70 : weigh less than 70And : ‘&’
Both these conditions must be met, with other words for each observation gender=="male" and weight<70 must be true and this is expressed with the symbol & :
filter(pulse, gender=="male" & weight<70)
# A tibble: 24 × 13
id name height weight age gender smokes alcohol exerc…¹ ran pulse1 pulse2
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1993_J Troy 168 60 23 male no yes modera… ran 88 150
2 1993_L Fred… 178 58 19 male no no low sat 74 76
3 1993_M Just… 170 68 22 male yes yes modera… sat 70 71
4 1993_N Erne… 187 59 18 male no yes high sat 78 82
5 1993_Q Lesl… 170 56 19 male no no low sat 64 63
6 1993_U Jero… 175 60 19 male no no low sat 88 86
7 1993_Y John 176 59 19 male no no modera… sat 68 69
8 1995_A Pedro 170 60 18 male no yes modera… sat 62 59
9 1995_C Oscar 189 60 19 male no yes modera… ran 78 168
10 1995_I Marc 175 65 19 male no yes modera… ran 60 104
# … with 14 more rows, 1 more variable: year <dbl>, and abbreviated variable name
# ¹exercise
Another example: select females with high frequency of exercise:
filter(pulse, gender=="female" & exercise=="high")
# A tibble: 3 × 13
id name height weight age gender smokes alcohol exerc…¹ ran pulse1 pulse2
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1993_C Consu… 167 62 18 female no yes high ran 96 176
2 1993_I Sonja 164 56 19 female no yes high sat 68 68
3 1998_G Ursula 155 55 20 female no yes high sat 82 87
# … with 1 more variable: year <dbl>, and abbreviated variable name ¹exercise
Or : ‘|’
The or conditional operator is true when at least one of the conditions is true. For example, list all rows with individuals who drink or smoke or both:
filter(pulse, alcohol=="yes" | smokes=="yes")
# A tibble: 70 × 13
id name height weight age gender smokes alcohol exerc…¹ ran pulse1 pulse2
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1993_A Bonn… 173 57 18 female no yes modera… sat 86 88
2 1993_B Mela… 179 58 19 female no yes modera… ran 82 150
3 1993_C Cons… 167 62 18 female no yes high ran 96 176
4 1993_D Trav… 195 84 18 male no yes high sat 71 73
5 1993_E Lauri 173 64 18 female no yes low sat 90 88
6 1993_F Geor… 184 74 22 male no yes low ran 78 141
7 1993_G Cher… 162 57 20 female no yes modera… sat 68 72
8 1993_H Fran… 169 55 18 female no yes modera… sat 71 77
9 1993_I Sonja 164 56 19 female no yes high sat 68 68
10 1993_J Troy 168 60 23 male no yes modera… ran 88 150
# … with 60 more rows, 1 more variable: year <dbl>, and abbreviated variable name
# ¹exercise
There are 70 observations who at least do one of the two: drink or smoke.
AnswerWhich group is missing from the result above?
Another example, filter rows from years 1995 and 1997
filter(pulse, year==1995 | year==1997)
# A tibble: 45 × 13
id name height weight age gender smokes alcohol exerc…¹ ran pulse1 pulse2
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1995_A Pedro 170 60 18 male no yes modera… sat 62 59
2 1995_B Olga 172 60 21 female no no low sat 81 79
3 1995_C Oscar 189 60 19 male no yes modera… ran 78 168
4 1995_D Kelli 178 56 21 female no yes modera… ran 86 150
5 1995_E Scott 175 75 20 male no yes modera… ran 59 92
6 1995_F Bobby 180 85 19 male yes yes modera… ran 68 125
7 1995_G Laur… 160 57 19 female no no modera… ran 75 130
8 1995_H Eliza 164 66 23 female no no low ran 74 168
9 1995_I Marc 175 65 19 male no yes modera… ran 60 104
10 1995_J Gwen 163 55 20 female no no modera… ran 70 119
# … with 35 more rows, 1 more variable: year <dbl>, and abbreviated variable name
# ¹exercise
Here we have the same variable year in our | construct.
AnswerWhat is wrong with this command: filter(pulse, year==1995 & year==1997) ?
is.na()
You have already seen is.na in action with vectors. It can be used in the context of filter as a logical condition to test for missing value: NA.
Take for example the pulse data, it has missing values in pulse1 and pulse2. We can find the row(s) with missing value for pulse1 with:
filter(pulse, is.na(pulse1))
# A tibble: 1 × 13
id name height weight age gender smokes alcohol exerc…¹ ran pulse1 pulse2
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1997_G Camil… 173 64 20 female no yes modera… sat NA NA
# … with 1 more variable: year <dbl>, and abbreviated variable name ¹exercise
We can see there is a single row with missing value pulse1. What about pulse2?
filter(pulse, is.na(pulse2))
# A tibble: 1 × 13
id name height weight age gender smokes alcohol exerc…¹ ran pulse1 pulse2
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1997_G Camil… 173 64 20 female no yes modera… sat NA NA
# … with 1 more variable: year <dbl>, and abbreviated variable name ¹exercise
Also a single row and it looks like to be the same row as the one with missing value for pulse1
AnswerCan you think of a logical condition that confirms that the missing
pulse1andpulse2values are actually in the same observation?
Equality test with NA, i.e
Negation operator: !
If you are interested only in observation that do not contain missing values, in this case in pulse1 and pulse2:
filter(pulse, ! is.na(pulse1))
# A tibble: 109 × 13
id name height weight age gender smokes alcohol exerc…¹ ran pulse1 pulse2
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1993_A Bonn… 173 57 18 female no yes modera… sat 86 88
2 1993_B Mela… 179 58 19 female no yes modera… ran 82 150
3 1993_C Cons… 167 62 18 female no yes high ran 96 176
4 1993_D Trav… 195 84 18 male no yes high sat 71 73
5 1993_E Lauri 173 64 18 female no yes low sat 90 88
6 1993_F Geor… 184 74 22 male no yes low ran 78 141
7 1993_G Cher… 162 57 20 female no yes modera… sat 68 72
8 1993_H Fran… 169 55 18 female no yes modera… sat 71 77
9 1993_I Sonja 164 56 19 female no yes high sat 68 68
10 1993_J Troy 168 60 23 male no yes modera… ran 88 150
# … with 99 more rows, 1 more variable: year <dbl>, and abbreviated variable name
# ¹exercise
between(x, left, right)
With between function it is possible to filter rows based on variable intervals:
# filter rows with height >=165 and height <= 166
filter(pulse, between(height, 165, 166))
# A tibble: 6 × 13
id name height weight age gender smokes alcohol exerc…¹ ran pulse1 pulse2
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1993_S Nicole 166 56 21 female yes no modera… sat 83 79
2 1995_M Desir… 165 63 18 female no yes modera… sat 84 84
3 1996_R Lucy 166 50 19 female no yes low ran 76 132
4 1997_E Julia… 165 48 19 female no no low sat 83 84
5 1998_C Dona 165 58 23 female no yes low sat 64 68
6 1998_H Haley 165 60 19 female yes yes low ran 88 120
# … with 1 more variable: year <dbl>, and abbreviated variable name ¹exercise
Note that example above can be explicitly written down with & :
filter(pulse, height >= 165 & height <= 166 )
# A tibble: 6 × 13
id name height weight age gender smokes alcohol exerc…¹ ran pulse1 pulse2
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1993_S Nicole 166 56 21 female yes no modera… sat 83 79
2 1995_M Desir… 165 63 18 female no yes modera… sat 84 84
3 1996_R Lucy 166 50 19 female no yes low ran 76 132
4 1997_E Julia… 165 48 19 female no no low sat 83 84
5 1998_C Dona 165 58 23 female no yes low sat 64 68
6 1998_H Haley 165 60 19 female yes yes low ran 88 120
# … with 1 more variable: year <dbl>, and abbreviated variable name ¹exercise
Both are valid solutions.
Copyright © 2023 Biomedical Data Sciences (BDS) | LUMC